The Hungarian National Corpus
نویسنده
چکیده
The paper reports on the development of the Hungarian National Corpus, which was completed at the end of 2001 after four years’ effort. The HNC is designed to be a balanced reference corpus of current written Hungarian consisting of 150 million words. The paper first discusses basic design issues concerning the composition of the corpus. The HNC adopts a fairly pragmatic approach, focusing on five major text types. The second half of the paper contains details of the annotation and tagging system used.
منابع مشابه
The Hungarian Gigaword Corpus
The paper reports on the development of the Hungarian Gigaword Corpus, an extended new edition of the Hungarian National Corpus, with upgraded and redesigned linguistic annotation and an increased size of 1.5 billion tokens. Issues concerning the standard steps of corpus collection and preparation are discussed with special emphasis on linguistic analysis and annotation due to Hungarian having ...
متن کاملHungarian Word-Sense Disambiguated Corpus
To create the first Hungarian WSD corpus, 39 suitable word form samples were selected for the purpose of word sense disambiguation. Among others, selection criteria required the given word form to be frequent in Hungarian language usage (frequency rates available in the Hungarian National Corpus (HNC) were used for measurement (Váradi, 2000)), and to have more than one sense considered frequent...
متن کاملMorpho-syntactic ambiguity and tagset design for Hungarian
The paper reports on work in progress to develop a tag set for Hungarian. The rich morphological structure of the language makes tagging feasible only after a full-scale morphological analysis, which results in a magnitude of patterns that do not easily translate into a corpus tag set of manageable size. The paper analyses the extent and types of morpho-syntactic ambiguity found in a 21m word s...
متن کاملThe Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus
The Szeged Corpus is a manually annotated natural language corpus currently comprising 1.2 million word entries, 145 thousand different word forms, and an additional 225 thousand punctuation marks. With this, it is the largest manually processed Hungarian textual database that serves as a reference material for research in natural language processing as well as a learning database for machine l...
متن کاملA Hungarian Sentiment Corpus Manually Annotated at Aspect Level
In this paper we present a Hungarian sentiment corpus manually annotated at aspect level. Our corpus consists of Hungarian opinion texts written about different types of products. The main aim of creating the corpus was to produce an appropriate database providing possibilities for developing text mining software tools. The corpus is a unique Hungarian database: to the best of our knowledge, no...
متن کامل